On March 13th, Sesame unveiled its latest speech synthesis model, CSM, attracting significant industry attention. According to the official introduction, CSM adopts an end-to-end Transformer-based multimodal learning architecture. It understands contextual information to generate natural and emotionally rich speech with stunningly realistic sound. The model supports real-time speech generation, processing both text and audio inputs. Users can also control features such as tone, intonation, rhythm, and emotion by adjusting parameters, showcasing high flexibility. CSM is considered a breakthrough in AI speech technology.